The impacts of transfer learning, phylogenetic distance, and sample size on big-data 1 bioacoustics 2

: 12 Vocalizations in animals, particularly birds, are critically important behaviors that influence their 13 reproductive fitness, but automatically extracting vocalization data from existing large databases 14 has only recently gained traction and has yet to be evaluated with respect to accuracy of different 15 approaches. Here, we use a recently-published machine learning framework to extract syllables 16 from six bird species ranging in their phylogenetic relatedness from 1–85 million years, 17 comparing how phylogenetic relatedness impacts accuracy as well as the utility of applying 18 trained models to novel species. Model performance is best on conspecifics, with accuracy 19 progressively decreasing as phylogenetic distance increases between taxa; however, using 20 models trained on multiple distantly related species can recover the lost accuracy. When 21 planning big-data bioacoustics studies, care must be taken in sample design to maximize sample 22 size and minimize human labor without sacrificing accuracy. 23


Avian bioacoustics
Vocalization is an important form of communication across many taxa.Birds (class Aves) make and respond to vocalizations profusely, for example in territory defense [1], mate choice [2,3,4], both conspecific and heterospecific communication [5,6], species discrimination [7,8], prey detection [9], and food solicitation via begging calls [10,11].Bird vocalizations are generally classified into songs and calls.Significant attention has been paid to song vocalizations, which are thought to mediate sexual selection [12,13,14] and be important in the generation of reproductive isolation in many bird taxa [15,16,17,18].Though less prominent, bird calls are also important as they can mediate non-sexual behaviors [19], for instance between parents and offspring [20] or even intra-specific communication [21].Despite evidence that both the biotic and abiotic environment can impact the development and evolution of vocalization, there are few studies that examine how the community soundscape evolves in tandem [but see 22,23].
The sheer volume of recording data available to modern researchers represents a tremendous asset to avian bioacoustic research.For example, the citizen-science initiative Xeno-Canto (xeno-canto.org)having ~700,000 recordings and the Macaulay Library bioacoustics repository (which includes eBird, [24]) having ~1.1 million recordings (birds.cornell.edu/MacaulayLibrary).Even the smaller bioacoustic collections, which represent important repositories of historical recordings, can have tens of thousands of songs to parse (e.g., the Borror Laboratory of Bioacoustics with ~36,000, blb.osu.edu).This volume of data represents a methodological challenge because most bioacoustic workflows require the songs to be segmented [11], either identifying them from background sounds or, for more complex analyses, identifying individual syllables of song [e.g., 25,26].
Ideally, automated methods would be useful if they could reduce the amount of human interaction without compromising data quality, as song segmentation is extremely time consuming (e.g., [27] found manual annotation to take nearly five times as long, even after accounting for needing to check automatic annotations), though less so than the gold-standard methods of performing playback experiments to directly evaluate behavioral responses to different vocalization treatments (e.g., [28] performed ~4,600 minutes of playback experiments to investigate a single species).

Machine Learning
Machine learning, which can be either unsupervised or supervised, is a useful tool for data processing.Unsupervised machine learning is typically used to cluster data [29,30,31], while supervised learning is frequently used to make predictions [32].Many well-known algorithms in ecology and evolution are derived from machine learning models in some capacity, for example linear regression, k-means clustering [33], and MaxEnt niche modeling [34,35].
However, most of these techniques require inputs that are specifically curated, such that the presence or absence of features and variables must be determined beforehand, which can lead to a slew of problems.
More recently, deep learning via artificial neural networks (ANNs) has become popular.
ANNs are made from many layers of nodes which construct weighted mathematical functions to perform tasks, typically from input data with diagnostic features [36].Deep learning methods are methods where the features used do not need to be pre-trained.Because ANNs can iteratively learn their own features, use large amounts of data, and are less constrained by assumptions about that data, they are extremely flexible and can handle many kinds of tasks [e.g., 36 37, 38, 39, 40, 41, 42].These methods have been used on a variety of topics including image processing, video segmentation, and speech recognition [43,44,45].Although challenges remain with respect to scalability, computational efficiency, and how to handle depauperate data [42], deep learning is one of the most powerful analytical tools in the modern researcher's toolbox, particularly when human knowledge is lacking, or datasets are too large to be workable by traditional means.In the context of ecology and evolutionary biology, there have been many recent applications of both shallow and deep machine learning, including population genetics and phylogeography [e.g., 46,47], bioacoustics [e.g., 48,49,50], species classification [e.g., 51], phylogenetics [e.g., 52,53], sequencing and genomics [e.g., 54,55], and phenotypic analyses and morphometrics [e.g., 56,57].Neural networks and support vector machines tend to be the most applied algorithms towards these analyses.
Though there are many ways to categorize machine learning algorithms, a useful one in biology is to categorize evaluative vs extractive algorithms.Evaluative algorithms are typically used to make predictions from data, for example by training a model to distinguish between simulated evolutionary scenarios [46,47,58,59,60].Extractive algorithms, on the other hand, are designed to process data in some way.With respect to bioacoustics, an extractive algorithm could, for example, segment out syllables within vocalizations.A recently developed application, TweetyNet, was released to perform just this task [50,85,86] using deep learning via ANNs.Specifically, TweetyNet uses convolutional and recurrent ANNs.As such, it is one extractive algorithm that could be useful in reducing the human workload in bioacoustics.However, though potentially much faster than hand-processing data, ANNs like TweetyNet can still take prohibitively long times to train, though they often only need to be trained once.
Minimizing training time can be done by co-opting existing trained ANNs to perform similar tasks, known as transfer learning.This can be used to take a neural network trained on one batch of data, and then applying it to new, related data [42,61].For example, a neural network that is trained to identify the leaves of oak trees from images may be able to be co-opted to identify leaves of maple trees from images without having to re-train the model to recognize a generic leaf shape.As an example which is more applicable for bioacoustics, a neural network that is trained to identify the song of one species against background noise may be able to be applied to a different species.
Here we compare different treatments of bioacoustic data to optimize the overall performance of machine learning models trained to segment syllables of avian song.We investigate the utility of these methods across a subset of bird species, including birds from different clades ranging in divergence times from ~1 to ~85 million years.We use those taxa to evaluate how well models perform on closely vs distantly related taxa for use in maximizing computational efficiency.Finally, we provide suggested practices for moving forward with neural networks in avian bioacoustics.

Materials and Methods:
Unless otherwise noted, all analyses were performed using a MacBook Pro running macOS Catalina (10.15.7) with a 2.8 GHz i7 core and 16 GB of RAM.Data used will be archived on Dryad.Code is available on GitHub at github.com/kaiyaprovost/bioacoustics.

Acquiring and filtering song data
We downloaded bioacoustics data from two song databases: Xeno-Canto (xenocanto.org), and the Borror Laboratory of Bioacoustics (blb.osu.edu).We used the Xeno-Canto database as it has a large amount of data of varying quality, due to it being a citizen science initiative, and a convenient R package exists for accessing and filtering the files.The Borror Lab of Bioacoustics database was used because it is full of historical data and based in North America, the region we wished to focus on, in addition to being the lab where the authors are based (Fig 1).For the Xeno-Canto songs, we targeted songs from Cardinalis sinuatus, Cardinalis cardinalis, Zonotrichia leucophrys, Empidonax virescens, and Calypte anna.We chose these taxa because they were distributed across the phylogenetic tree and had relatively high numbers of recordings available.We also targeted songs from Melozone fusca which we used to evaluate the performance on a novel species.We restricted the data downloaded to all those in North America that were of Quality "A" according to the Xeno-Canto database using the package warbleR version 1.1.26[62] in R version 4.0.3[63].Xeno-Canto recordings are available in the MP3 format, but since our downstream pipeline requires WAV format, we converted them using the package tuneR version 1.3.3[64].For the Borror Laboratory of Bioacoustics songs, we downloaded all available data from all species, which were already available as WAV format.
Stereo recordings were converted to mono using the 'mono' function in tuneR.We then retained recordings that had a sample rate of 48,000 Hz, which was the most common sample rate in our dataset, because downstream analyses require recordings of the same sample rate and will not work on stereo recordings.

Identification of syllables via manual and automatic annotation of song data
We arbitrarily chose a subset of recordings to annotate from each species, starting with Zonotrichia leucophrys which had the largest dataset, and moving down to Calypte anna which had the smallest dataset.As few as three minutes of recording is sufficient for the TweetyNet algorithm to perform accurately [50], so we ensured that all species had at least 180 seconds worth of annotated syllables.We then added more recordings for species with more data to investigate the influence of sample size.For Melozone fusca, we deliberately annotated (and trained on) a disproportionately small number of songs to generate artificial scarcity of data.In both the Borror Lab of Bioacoustics and the Xeno-Canto database, many species have a small number of recordings.For the former, which primarily sampled North American taxa, nearly 9,400 bird species lack any data, with an additional ~400 species with fewer than 10 recordings.
For the latter, ~600 species lack any data, with an additional ~2,500 with fewer than 10 recordings (Fig 2).As such, having methods which can handle species with small numbers of recordings is critical.x-axis has a gap to show an exceedingly high number of species with no data, as BLB is North America focused.
Annotations were performed in Raven Pro version 1.6.1 using a spectral window size (i.e., Fast Fourier Transform size) of 512 samples, which for a sample rate of 48,000 Hz was ~9.97 milliseconds.We drew boxes representing frequency and time boundaries for every syllable in the recording of the focal species.These annotations were exported as text files using Raven's built-in selection format.After annotating the songs, we then subsetted the recordings such that individual recordings had approximately equal amounts of annotated song and silence.
Preliminary results showed that when TweetyNet was trained on full files, which had a high proportion of non-annotated "silence", TweetyNet would fail to identify the songs and would only predict silence, but with high accuracy.As such, we wrote a script using tuneR in R which automatically sliced WAV files and their corresponding annotation files for training.Annotations that were separated by at least 1.0 second of non-annotated sound were split.Splits were made such that at least 50% of the resulting file was annotated: in cases where this was not possible, a larger percentage of the file may have been annotated.We then converted the annotation files into XML files that TweetyNet could use following the "Koumura" format following [65].All annotations were given the label "1".

Training of machine learning model using TweetyNet
When training TweetyNet, we set our spectrogram parameters as follows: Fast Fourier Transform size of 512, step size of 32 samples, frequency cutoffs of 500 and 10,000 Hz, a log transformed spectrogram, and a minimum power log threshold of 6.25.We also used a window size of 88 spectral windows and normalized spectrograms while training.Unless otherwise noted, we used an 80:10:10 split for training, validation, and test data, and tracked the total training time for each model.These splits were generated manually for each species.Each model used a batch size of three and was validated every 50 steps.Checkpoints were taken every 200 steps.We ran for 10 epochs (i.e., 10 passes the machine learning algorithm made over the full training dataset) but stopped training prematurely if the model had gone 50 checkpoint steps without an improvement in accuracy.We then used the maximum accuracy checkpoint as our final trained model.Some models were restarted if they were killed before completing 50 checkpoints without improvement.Models that failed this process multiple times were restarted and ran as needed.Models were run sequentially (i.e., one at a time).
We trained three sets of models: single-species, multi-species, and transfer learning.For the single-species models, we trained a model in TweetyNet on a given species to 5-step and 50step checkpoints.We did this once for each of our five species with sufficient data, resulting in ten total single-species models.For the multi-species models, we trained a model in TweetyNet as above but using a combination of the five species with sufficient available data.We did this for two models: an unbalanced and a balanced model.For the unbalanced, we included all training data from each of the five species irrespective of the sample sizes for each.For the balanced model, we subset the training and validation data used such that each species had the same amount of training and validation data (in seconds) as our species with the smallest sample size, to mitigate any effects of biased sampling.Each was trained to 5-step and 50-step checkpoints, resulting in four total multi-species models.
To incorporate transfer learning for use in the transfer models, we wrote a custom module for TweetyNet ("train_checkpoint") which was able to continue training from an existing checkpoint file from a previously completed trained model using new training data from a different taxon.At the time of writing, this functionality has been planned but not executed in TweetyNet's source code.After training both multi-species models as above, we then used that finished checkpoint to train on data from Melozone fusca, which we manipulated to use a relatively small amount of data (aka "few-shot learning" [66]).Instead of training for 50 checkpoints which would likely lead to overfitting, we trained for 5 checkpoints only.Though we had a comparable amount of data for this species compared to the other five, we generated artificial scarcity by only using ~30% of this as training and validation data, with the remainder acting as test data.This resulted in four total transfer-learning models depending on whether we trained from the balanced or unbalanced multi-species model, and whether we trained from the 5-step or 50-step checkpoint of those models.
We tested the single-species and multi-species models on annotated data from each of the five species we trained on with sufficient available data (Zonotrichia, Cardinalis, Empidonax, and Calypte species; either zero-shot or normal training) as well as the species with artificially insufficient available data (Melozone fusca; zero-shot training).For the transfer models, we only tested on Melozone fusca (few-shot training).In all cases, we did this testing to assess the accuracy of the model.We evaluated accuracy across species with respect to the predicted genetic distance.In addition, we evaluated the performance with respect to different transfer methods, the number of species present in the training dataset, the balancing approach for multispecies models, and the impact of zero-shot vs few-shot vs normal training.
For some analyses we examined the phylogenetic distance between trained species and test species.We mined these data from timetree.org[67], taking the estimated time listed as the time of divergence in millions of years (Mya).Intraspecific comparisons were given an estimated time of 0.0 Mya; we also did this for the transfer learning done with Melozone fusca.For the multiple species models, we calculated the weighted average estimated divergence time based on the sample sizes for each species, again taking conspecifics as 0.0 Mya.

Song diversity
We evaluated song diversity to determine whether this impacted our machine learning model accuracy.Song diversity was evaluated in two ways: first, we consulted an analysis that focused on Passeriformes which measured song complexity [68].Three of the species had complexity analyzed in that study (Zonotrichia leucophrys, Cardinalis cardinalis, Cardinalis sinuatus) which were calculated by performing a PCA on song properties (e.g., number of notes, frequency bandwidth).We directly used those values as our complexity measure.Our second metric of song diversity was done by estimating the hypervolume complexity.To do this, we took the 5 longest WAV files for each species that had been manually annotated.We then put them through the SoundShape pipeline [69].SoundShape aligns and normalizes syllables of sound to make them directly comparable with respect to their size and shape.After aligned and normalized syllables were converted to TPS format, we randomly selected 50 syllables per species and calculated a principal components analysis on their values, retaining the first three principal components (PCs).We then used the three PCs to calculate a hypervolume per species, as a percentage of the total hypervolume occupied by all species.We repeated this 50 times and then calculated the mean percentage hypervolume occupied per species as our complexity metric.Species with more diverse syllable types, and therefore more song diversity, should have larger hypervolumes.

Impact of various metrics on accuracy and time of training
To investigate what aspects of our models determined accuracy and training time, we used generalized linear regressions as well as ANOVA and Tukey's Honest Significant Differences tests [70,71,72] using the stats package in R. For all models, we took the log values of training time and sample size, as these varied by orders of magnitude across taxa and models.
Unless otherwise specified, we set our alpha level to 0.05 to determine significance, and we applied linear models for 5-step and 50-step checkpoint runs independently.
We tested the relationship between the size of the training dataset and the amount of training time using a linear regression.We did this both with and without transfer learning models.We also tested the relationship between accuracy and phylogenetic distance, accuracy and song complexity, and/or accuracy and hypervolume diversity as linear regressions both for single-species models only and for all models.The linear regressions involving song complexity and hypervolume diversity were run calculating these metrics for the testing species as well as the training species.Finally, we used an ANOVA to determine whether individual species differed in accuracy as well.

Machine model performance on annotated song
Training time of the single species models to 5 checkpoints ranged from 310 seconds (~5.2 minutes, Calypte anna, 256 seconds of song) to 1,999 seconds (~33.3 minutes, Zonotrichia leucophrys, 3,878 seconds of song; Table 1).For 50 checkpoints, these times ranged from 4,506 seconds (~75.1 minutes or ~1.3 hours) to 44,694 seconds (~744.9minutes or ~12.Accuracy values are given for models checkpointed at 5-steps and 50-steps.Estimated divergence time for "5 Species" models is a weighted average, proportional to the sample sizes used for each species trained on. Accuracy in the 50-step checkpoint runs was nearly always equal or higher than accuracy in the 5-step checkpoint runs.Across all 42 pairs of single-species and multiple-species models, 32/42 had higher accuracy in the 50-step checkpoint (difference range=1-30%), 3/42 had the same accuracy, and 7/42 had higher accuracy in the 5-step checkpoint (difference range=1-6%).
For models that were trained on single species, accuracy was generally highest when models were tested on the same species that they were trained on (5-step range=89-93%; 50-step range=91-97%), except for Cardinalis cardinalis in the 5-step checkpoints (in which accuracy was highest for its congener Cardinalis sinuatus, 87% vs 67% for Cardinalis cardinalis) and Calypte anna in both 5-step and 50-step checkpoints (where highest accuracy was also for Cardinalis sinuatus, 62-67% vs 56-60% for Calypte anna).
Accuracy dropped off in a manner that scaled to the predicted phylogenetic distance in the models that were not trained and tested on the same taxa (i.e., zero-shot training), with loss of 0-5% for congeneric comparisons (plus one gain of 20% in Cardinalis cardinalis), 4-32% for taxa with the same oscine/suboscine status (plus one gain of 15% in Cardinalis cardinalis), 2-24% for comparisons between birds in the same order, and 3-30% between passerines and nonpasserines (with gains of 2-11% in Cardinalis cardinalis and Calypte anna).This relationship is significantly negative with respect to estimated phylogenetic distance for both 5-step and 50-step models (adjusted R 2 range=0.27-0.35,p<0.0002;Fig 3).Most of the multi-species models perform the same or worse at classifying a species than the corresponding single-species model trained on that species (0-8% loss in accuracy across all taxa), except for the Cardinalis cardinalis and Calypte anna models mentioned above (12-20% gain in accuracy) and one Zonotrichia leucophrys 5-step checkpoint model (with a 1% gain).
However, multi-species models tend to perform better than zero-shot learning on other singlespecies models (improvement of 0-33%) with a few exceptions that are worse (by 1-3%).
Multiple species models perform better than expected for their mean estimated divergence time: of the 24 tests done across the models, 21 are above the line of best fit and three are below (Fig

3).
Comparing the 12 unbalanced and balanced multi-species model pairs, the differences in performance within species ranged from 4% better for balanced models to 9% better in unbalanced models, with 8/12 of those pairs differing by less than 1%.Zonotrichia leucophrys, the most over-represented species, has the unbalanced models performing much better (by 3-9%).
There is no significant association between sample size of either training or test data and average accuracy across single-species models (log sample sizes, p-value range=0.33-0.53),although when multiple-species and transfer learning models are included, there is a slight positive association with test data sample size and accuracy (p-value range=0.009-0.03,adjusted R 2 range=0.09-0.12).However, Calypte anna vocalizations are significantly less likely to be classified accurately than other species (p-value<0.03).Notably, this is the only non-passerine in the dataset and has a very different type of song than the others.

Zero-shot vs few-shot learning
With respect to the small Melozone fusca dataset, classification of this species' songs is overall highly accurate (Table 3, Fig 4).For the single-species models, the least accurate was that of the distantly related Calypte anna (72-80%).For the other species, the 50-step checkpoint models all perform equivalently (91% accuracy) while in the 5-step checkpoint models, accuracy ranges (81-91% accuracy).For the multi-species models, the unbalanced model (91-94%) performed slightly better than the balanced model (88-93%), and slightly better than the closest related species Zonotrichia leucophrys (88-91%), suggesting that having a model trained on a more diverse subset of sounds improves performance of zero-shot learning.Lastly, the few-shot learning transfer models perform the best of them all with little additional training time (93-95%), suggesting that having a single multi-purpose model rich in data to "seed" the training of species with few data may be beneficial.Accuracy values are given for models checkpointed at 5-steps and 50-steps.Estimated divergence time for "5 Species" models is a weighted average, proportional to the sample sizes used for each species trained on.

Song diversity results
Song complexity values (from [68]) ranged such that Cardinalis cardinalis had the lowest complexity, followed by Cardinalis sinuatus, and then Zonotrichia leucophrys with the highest complexity (Table 4).In contrast, our calculated hypervolume diversity metric finds that instead Cardinalis sinuatus has the smallest hypervolume, followed by Empidonax virescens, Melozone fusca, Cardinalis cardinalis, Zonotrichia leucophrys, and finally Calypte anna with the largest hypervolume.The latter two species occupy much higher percentages of hypervolumes compared to the remaining four species.With such a small number of points (N=3 for complexity from Medina and Francis 2012, N=6 for hypervolume estimates) we do not have the power to detect any correlations between the two values, but we expect that complexity and diversity should be correlated.
Neither complexity [68] nor mean hypervolume of the species used for training are significant predictors of accuracy (p>0.11).Likewise, song complexity [68] of the species used for testing is not a significant predictor of accuracy (p>0.45).However, the mean hypervolume of the species used for testing is negatively associated with accuracy, such that the more diverse a species' song is, the less accurate the model is at predicting it (p<0.0002,adjusted R 2 >0.28).Complexity scores range from lower complexity (more negative) to higher complexity (more positive).Hypervolume diversity is a percentage of total hypervolume across all six species, with means and standard deviations from 50 independent runs of 50 random syllables per species.
1 from [68] In terms of performance, training a model on a species and applying it to that same species appears to be the most accurate method of extracting syllables, provided that the sample size is sufficiently high.This is irrespective of whether the training dataset includes only that species or also includes other taxa (e.g., the 5-species models).However, the range in accuracy varies highly across taxa.Species with more simple vocalizations (e.g., Empidonax) appear more likely to be classified with high accuracy than species with more complex vocalizations (e.g., Zonotrichia).We speculate this is due to a combination of factors.Taxa with more complex vocalizations must optimize the detection algorithm for multiple different kinds of syllables at once, which is expected to be more difficult than categorizing fewer kinds of syllables.Our models were run under the same parameters and for the same training times, though they were allowed to exit early if they reached an accuracy plateau; given this, it is likely that the convolutional neural network can solve a "simpler" problem in the same amount of time more efficiently.Another consideration, however, is sample size given to the model.Although we did not find an explicit correlation between sample size and accuracy, it is nevertheless possible that our high accuracy on these species is due to over-fitting, rather than any biological aspect of the song.However, we do not attribute these differences to learning vs. non-learning behavior; although Empidonax are the only suboscine (non-learning) Passeriform bird in our dataset, there is not a relationship between song complexity and oscine/suboscine status in Passeriformes [76].
We find that model accuracy drops with phylogenetic distance between the training data and tested data.This suggests that there is commonality between closely related species in terms of their song; indeed, evidence across passerines suggests that song is phylogenetically constrained, in addition to being influenced by morphology and ecology [76,77].From these results we would predict that any similar learning algorithm should behave in this fashion with traits that are associated with phylogeny.This pattern may also extend to species with convergent song features as well, though this remains to be tested.
Though the sample size does not appear to impact accuracy, it is an important consideration when trying to decide how to best extract data from species.With sufficiently large sample sizes (TweetyNet suggests 3 minutes of annotation, [50]), techniques like convolutional neural networks will work effectively to learn and parse the diversity of syllables and sounds present in these data.However, for some taxa the available data are not sufficient for these types of learning.In this case, we have two recommendations depending on the study system.If a study only concerns one single species without a large amount of data, it would likely be best to manually annotate all data available, rather than try to fit a computationally expensive model to it.On the other hand, if a study is focused on multiple species at once, especially if some of those species have a large sample size, we recommend training a model on multiple different species, optionally then using transfer learning techniques to fine-tune the model.The caveat with this latter approach is that the benefit of this learning may decrease as phylogenetic distance (and song dissimilarity) increases with respect to your focal taxa.
We find that transfer learning from multiple species models substantially increases accuracy for taxa without a large amount of data.However, this might also be partially due to overfitting of very small datasets.Transfer efficiency seems to be just as good when balanced vs unbalanced.We recommend balancing, however, because it appears to perform just as well with significantly less data and thus less training time.In terms of computational efficiency, we find that training a model with multiple different species takes longer than training a model on each species separately.We suspect this is because the algorithm needs to optimize segmentation of many disparate kinds of syllables, much like with species that have more complex songs compared to species with fewer songs.However, we also find that models trained on multiple species perform better than models trained on single species when applied to a never-before-seen taxon, even after considering average phylogenetic distance between training species and testing species.Despite the increase in computation time, having a more general-purpose, multi-species model trained can be worthwhile for researchers looking to maximize the number of taxa they can segment syllables for, while minimizing the amount of manual annotation and training time they need.

Downstream applications and implications
The approach we develop here is fast, flexible, and can be run on a standard laptop, though performance will benefit from parallel processing.As such, our application can supplement the deep avian bioacoustics literature with the breadth afforded by efficient data generation.In addition, our overall framework will likely be applicable to other machine learning algorithms: TweetyNet could be replaced by other convolutional/recurrent neural network methods with these guidelines still being relevant.It remains to be seen whether other methods, random forests or evolving neural networks, will perform similarly [78,79,80,81].
Automatic methods can be used to validate and supplement manual segmentation analyses, as are traditional in avian bioacoustics.From the most basic, this approach can be used with sparser forms of data, for instance continuous recordings where the species of interest only vocalizes for a short amount of time.Segmentation with this method and downstream scripts we have developed here can reduce the amount of recording that needs to be searched by quickly eliminating large portions of the files.This segmentation method can also be used to standardize projects among individuals who are all working concurrently.Human biases in segmentation are a known phenomenon [82,83,84], but applying a single model to begin from would potentially alleviate some of those issues.Further, when working from automatically segmented data, the ability to recognize incorrect or otherwise anomalous syllables where models and manual labeling do not match would make data cleaning swifter and more reliable.Lastly, the use of these models can help assist in training younger scientists without potentially compromising data quality; with a standard template to work from and the ability for the machine learning algorithm to partially check classifications, students could spend more time developing other research skills than simple song annotation.
Automatic segmentation can be useful not just to reduce the workload of people, but also to set up for downstream applications.First, though not implemented here, TweetyNet can not only identify the presence of syllables, but also individual types of syllables; indeed, this is TweetyNet's original goal and it does so with high accuracy.However, this requires knowledge of what types of syllables exist in the dataset, and in taxa that have not been studied before, it is hard to know a priori what those sounds are.However, an algorithm that can broadly detect sounds can thus lead into automatic classification of syllables via clustering analyses, which exist already in easily implemented fashions (e.g., K-means clustering, [33]).It is beyond the scope of this manuscript to discuss clustering methods but being able to partially (or fully) characterize the repertoire of a new species automatically is within the realm of possibility.Further, any methods that can segment the syllables as well as classify types of syllables allow for the fine detection variation within those syllable classes, as well as changes in the syntax of bird song (i.e., the arrangement of sounds into specific sequences).This contrasts with changes in other features of song, like frequency, duration, or bandwidth, which are independent of the order of specific syllables.Regardless, the ability to categorize these sounds will be particularly useful in getting a more thorough assessment of song complexity within and between taxa.
Segmentation of sound is a critical part of analysis of song in birds, as well as other taxa that communicate with sound including some insects, frogs, and bats.Given the sheer number of species that this encompasses, having a standardized and automatic approach will allow for fast performance of traditional bioacoustic analysis.Our method forms a starting point for this work.
Given enough time, we foresee that these and other models can be further refined and used to segment all recordings available for all species, which would likely comprise multiple lifetimes of work if done manually and be intractable with the current speed at which new recordings are added.Further, it is not outlandish to think that as machine learning methods become more sophisticated, software could be developed to auto-segment sounds as they are cataloged into these databases and repositories, or even to be integrated into the recorders and microphones that are commonly used today.

Conclusions
In this paper, we investigate the influence of phylogeny on machine learning methods to extract syllables from recordings of bird song, finding that the accuracy of these methods depends in part on the phylogenetic relatedness of the taxa being trained on and tested on.
However, the loss in accuracy associated with distantly related taxa can be ameliorated by including multiple species.Further, species with smaller sample sizes can be classified with high accuracy by tuning previously trained models.We suggest the following best practices for using machine learning in bioacoustics, depending on the sample sizes of species of interest: for very small amounts of total data, it is best to hand-annotate syllables to segment them irrespective of the number of species.With larger amounts of data, hand-annotation of select species that span the phylogenetic (and bioacoustic) variation in the dataset, then training a machine learning model to segment syllables from it will reduce computational load without a major disruption to accuracy.This framework will be broadly applicable to many regions of the world and many

Fig 1 :
Fig 1: Flowchart describing the procedure from acquiring song recordings to having

Fig 2 :
Fig 2: Most species of bird have little bioacoustic data.X-axis shows, on a log scale, the 4 hours).The multiple species models took 4,214 seconds (~70.2 minutes or ~1.2 hours) for the unbalanced (5,553 seconds of song) and 557 seconds (~9.3 minutes) for the balanced (1,130 seconds of song) to reach 5 checkpoints.For 50 checkpoints these numbers were 25,863 seconds (~431 minutes or ~7.2 hours) for unbalanced and 10,348 seconds (~172.4minutes or ~2.9 hours) for balanced.The four transfer learning models, which were all trained for 5 checkpoints each, took between 246 seconds (~4.1 minutes) and 665 seconds (~11.1 minutes) to train on the Melozone fusca data (71 seconds).Less additional training time was required when training using the 50step checkpoints than the 5-step checkpoints, though total training time was higher.The relationship between the size of the training dataset and the amount of training time needed is positive on a log scale for both the 5-step and 50-step checkpoint runs; this holds even after excluding transfer learning models which were only run for 5-step checkpoints (adjusted R 2 range=0.51-0.77,p<0.021;

Fig 3 :
Fig 3: Model performance by species drops with estimated divergence, but training on

Fig 4 :
Fig 4: Model performance on Melozone fusca drops with estimated divergence, but also taxa, allowing us to achieve a global perspective on vocalizations.Ultimately, this innovation will change how biologists analyze songs by expanding the scope of what is possible from depth to breadth.
Our dataset included 6,563 Zonotrichia leucophrys songs, 918 Cardinalis cardinalis songs, 36 Cardinalis sinuatus songs, 128 Empidonax virescens songs, 102 Calypte anna songs, and 72 Melozone fusca songs.We filtered out songs that did not have a sample rate of 48,000 Hz and then annotated 197 Zonotrichia leucophrys songs, 61 Cardinalis cardinalis songs, 8 Cardinalis sinuatus songs, 22 Empidonax virescens songs, 15 Calypte anna songs, and 12 Melozone fusca songs for each species.After subsetting files to balance the relative amount of annotated sound to silence, this resulted in total datasets of 4,843 seconds for Zonotrichia leucophrys, 1,083 seconds for Cardinalis cardinalis, 401 seconds for Cardinalis sinuatus, 291 seconds for Empidonax virescens, 316 seconds for Calypte anna, and 272 seconds for Melozone fusca (see Supplementary Table

Table 1 : Models trained in TweetyNet across multiple species vary in training time, dataset size, and type of mode
. "5 Species" models were trained on all data from the "Single species" type models."Balanced" models were trained on equal amounts of data from each of the "Single species" type models.